4. MySql tables

Here we describe MySql tables used by locust and their columns. Tables design was borrowed from ASPseek.

wordurl

This table keeps the database dictionary.

word: Word itself in the unicode.
word_id: The word numerical handle.

urlword

This table keeps information about all encountered URLs, both indexed and not indexed yet which match conditions specified in configuration files.

ID of URL.
site_id: ID of site, refers to sites.site_id.
deleted: Set to 1 if server returned an error.
url: URL itself.
next_index_time: Time of next indexing in seconds from UNIX epoch.
status: HTTP status returned by server or 0 if document has not been indexed yet.
crc: MD5 checksum of document.
last_modified: "Last-Modified" field in the HTTP header.
etag: "ETag" field in the HTTP header.
last_index_time: Time of last indexing in seconds from UNIX epoch.
referrer: ID of URL which first referred this URL.
tag: Arbitrary tag.
hops: Depth of URL in hyperlink tree.
redir: URL ID, where current URL is redirected or 0 if this URL is not redirected.
origin: Set to 0 for the original, 1 for a clone.

urlwordsNN (where NN is 2-digit number from 00-15)

These tables contain additional info about existing indexed URLs. Number NN in table name is URL_ID mod 16.

deleted: Set to 1 if server returned an error.
wordcount: Count of unique words in the indexed part of URL.
totalcount: Total count of words in the indexed part of URL.
content_type: Content-Type HTTP header returned by server.
charset: Document charset taken from Content-Type HTTP header or META.
title: First 128 characters from pages title.
txt: First 255 characters from page body, stripped from HTML tags.
docsize: Total size of URL.
keywords: First 255 characters from page keywords.
description: First 100 characters from page description.
lang: Not used now.
words: Zipped content of URL.
hrefs: Sorted array of outgoing href IDs from this URL.

In the first 4 bytes (size of unsigned), the blob field words contains the size of the uncompressed document content or the value 0xFFFFFFFF if the content is stored uncompressed (this may happen if the compressed content is longer than the uncompressed one or if compression fails). The rest of the blob contains compressed or uncompressed content.

robots

This table contains information parsed from robots.txt file for each site.

hostinfo: Host name.
path: Path to exclude from indexing.

sites

This table contains IDs for all indexed sites.

site_id: ID of site.
site: Site name with protocol, like http://www.my.com/.

stat

This table contains information about query statistics for each completed query.

addr: IP address of computer, from which query was requested.
proxy: IP address of proxy server, through which query was requested.
query: Query string.
ul: URL limit used to restrict the query.
sp: Web spaces used to restrict the query.
site: Site ID used to restrict the query.
np: Results page number requested.
ps: Results per page.
sites: Number of found sites matching query.
urls: Number of found URLs matching query.
start: Query processing start in seconds from UNIX epoch.
finish: Query processing finish in seconds from UNIX epoch.
referer: URL of web page from which query was requested.